redo log探究

概要: 这篇文章主要的思路是redo log是如何从物理上保证页面的信息不受损失的,分别从update写入到mtr的redo logmtr的提交redo log写盘三部分来说明这个整体过程。


redo log 整体流程

实验准备

实验目的:跟踪一条完整的redo日志更新流程主要分为三个流程:

  1. 从 DML 操作 到 mini-transaction buffer
  2. 从 mini-transaction buffer 到 redo log_buffer;
  3. redo log_sys buffer 的写盘。
    实验代码:
1
2
3
4
5
6
7
create table table_tredo(
dept_no int,
from_join_time int,
dept_descript varchar(50),
primary key (dept_no)
);
insert into table_tredo values (0, 2, 'huangxiaodong');

update 写入到mtr的redo log

实验

代码:

1
2
3
update table_tredo
set from_join_time = 4
where dept_no = 0;

由于我们做的是optimisstic update并且是原地更新,因此我们直接跟踪 btr_cur_update_in_place 函数中间的一个片段:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
is_hashed = (block->index != NULL);
if (is_hashed) {
/* TO DO: Can we skip this if none of the fields index->search_info->curr_n_fields are being updated? */

/* The function row_upd_changes_ord_field_binary works only if the update vector was built for a clustered index, we must NOT call it if index is secondary */

if (!dict_index_is_clust(index)
|| row_upd_changes_ord_field_binary(index, update, thr,
NULL, NULL)) {
/* Remove possible hash index pointer to this record */
btr_search_update_hash_on_delete(cursor);
}
rw_lock_x_lock(btr_get_search_latch(index));
}

这是 adaptive hash 的代码,重点关注第一行的 is_hashed = (block->index ! =NULL) 这里面涉及到 block->index 也就是 block->index 和adaptive hash 密切相 关,这里需要注意下。

update in place 修改了哪些页面,并且修改了这些页面的哪些位置的记录?

这里我们主要说明update修改了页面的哪些东西,首先我们看 btr_cur_update_in_place 的关于页面更新和redo log的代码片段:

1
2
3
4
5
6
7
8
9
10
11
12
13
if (!(flags & BTR_KEEP_SYS_FLAG)
&& !dict_table_is_intrinsic(index->table)) {
row_upd_rec_sys_fields(rec, NULL, index, offsets,
thr_get_trx(thr), roll_ptr);
}
/* other code */

row_upd_rec_in_place(rec, index, offsets, update, page_zip);
if (is_hashed) {
rw_lock_x_unlock(btr_get_search_latch(index));
}
btr_cur_update_in_place_log(flags, rec, index, update,
trx_id, roll_ptr, mtr);

其中 row_upd_rec_in_place 是update in place 的主要的写页面数据的过程,后面的 btr_cur_update_in_place_log 则是redo log的过程,先看update操作向页面中写了哪些 东西:

  1. 更新了记录的transaction ID, Roll Pointer。
  2. 更新了记录中更新列的数据的长度和数据本身。
  3. 正如之前所看到的如果更新的列是一个hash索引并且如果该索引不是一个聚簇索引或者该 次更新导致索引的二进制顺序发生变化,则可能会移除该索引。

    update in place 中redo形成过程

    在update 记录之后通过:
1
btr_cur_update_in_place_log(flags, rec, index, update,trx_id, roll_ptr, mtr);}

来更新redo日志,下面我们来重点分析这个函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/***********************************************************//**Writes a redo log record of updating a record in-place. */
void
btr_cur_update_in_place_log(
/*========================*/
ulint flags, /*!< in: flags */
const rec_t* rec, /*!< in: record */
dict_index_t* index, /*!< in: index of the record */
const upd_t* update, /*!< in: update vector */
trx_id_t trx_id, /*!< in: transaction id */
roll_ptr_t roll_ptr, /*!< in: roll ptr */
mtr_t* mtr) /*!< in: mtr */
{
byte* log_ptr;
const page_t* page = page_align(rec);
ut_ad(flags < 256);
ut_ad(!!page_is_comp(page) == dict_table_is_comp(index->table));
log_ptr = mlog_open_and_write_index(mtr, rec, index, page_is_comp(page)
? MLOG_COMP_REC_UPDATE_IN_PLACE
: MLOG_REC_UPDATE_IN_PLACE,
1 + DATA_ROLL_PTR_LEN + 14 + 2
+ MLOG_BUF_MARGIN);
if (!log_ptr) {
/* Logging in mtr is switched off during crash recovery */
return;
}
/* For secondary indexes, we could skip writing the dummy system fields to the redo log but we have to change redo log parsing ofMLOG_REC_UPDATE_IN_PLACE/MLOG_COMP_REC_UPDATE_IN_PLACE or we have to addnew redo log record. For now, just write dummy sys fields to the redolog if we are updating a secondary index record.*/
mach_write_to_1(log_ptr, flags);
log_ptr++;
if (dict_index_is_clust(index)) {
log_ptr = row_upd_write_sys_vals_to_log(
index, trx_id, roll_ptr, log_ptr, mtr);
} else {
/* Dummy system fields for a secondary index */
/* TRX_ID Position */
log_ptr += mach_write_compressed(log_ptr, 0);
/* ROLL_PTR */
trx_write_roll_ptr(log_ptr, 0);
log_ptr += DATA_ROLL_PTR_LEN;
/* TRX_ID */
log_ptr += mach_u64_write_compressed(log_ptr, 0);
}
mach_write_to_2(log_ptr, page_offset(rec));
log_ptr += 2;
row_upd_index_write_log(update, log_ptr, mtr);
}

分析:

  1. 获取其页面的开始地址
1
const page_t* page = page_align(rec);

其中最核心的代码是:

1
2
/* ptr: rec所在的内存地址;align_no: 页面对齐大小这里是0x4000,即页面大小16KB*/
((void*)((((ulint) ptr)) & ~(align_no - 1)));

获取其页面的开始地址

  1. redo日志初始化
1
2
3
4
5
log_ptr = mlog_open_and_write_index(mtr, rec, index, page_is_comp(page)
? MLOG_COMP_REC_UPDATE_IN_PLACE
: MLOG_REC_UPDATE_IN_PLACE,
1 + DATA_ROLL_PTR_LEN + 14 + 2
+ MLOG_BUF_MARGIN);

先查看该函数的定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
/********************************************************//**Opens a buffer for mlog, writes the initial log record and,if needed, the field lengths of an index.@return buffer, NULL if log mode MTR_LOG_NONE */

byte*
mlog_open_and_write_index(
/*======================*/
mtr_t* mtr, /*!< in: mtr */
const byte* rec, /*!< in: index record or page */
const dict_index_t* index, /*!< in: record descriptor */
mlog_id_t type, /*!< in: log item type */
ulint size) /*!< in: requested buffer size in bytes
(if 0, calls mlog_close() and
returns NULL) */
{
byte* log_ptr;
const byte* log_start;
const byte* log_end;
ut_ad(!!page_rec_is_comp(rec) == dict_table_is_comp(index->table));
if (!page_rec_is_comp(rec)) {
log_start = log_ptr = mlog_open(mtr, 11 + size);
if (!log_ptr) {
return(NULL); /* logging is disabled */
}
log_ptr = mlog_write_initial_log_record_fast(rec, type,
log_ptr, mtr);
log_end = log_ptr + 11 + size;
} else {
ulint i;
ulint n = dict_index_get_n_fields(index);
ulint total = 11 + size + (n + 2) * 2;
ulint alloc = total;
if (alloc > mtr_buf_t::MAX_DATA_SIZE) {
alloc = mtr_buf_t::MAX_DATA_SIZE;
}
/* For spatial index, on non-leaf page, we just keep2 fields, MBR and page no. */
if (dict_index_is_spatial(index)
&& !page_is_leaf(page_align(rec))) {
n = DICT_INDEX_SPATIAL_NODEPTR_SIZE;
}
log_start = log_ptr = mlog_open(mtr, alloc);
if (!log_ptr) {
return(NULL); /* logging is disabled */
}
log_end = log_ptr + alloc;
log_ptr = mlog_write_initial_log_record_fast(
rec, type, log_ptr, mtr);
mach_write_to_2(log_ptr, n);
log_ptr += 2;
if (page_is_leaf(page_align(rec))) {
mach_write_to_2(
log_ptr, dict_index_get_n_unique_in_tree(index));
} else {
mach_write_to_2(
log_ptr,
dict_index_get_n_unique_in_tree_nonleaf(index));
}
log_ptr += 2;
for (i = 0; i < n; i++) {
dict_field_t* field;
const dict_col_t* col;
ulint len;
field = dict_index_get_nth_field(index, i);
col = dict_field_get_col(field);
len = field->fixed_len;
ut_ad(len < 0x7fff);
if (len == 0
&& (DATA_BIG_COL(col))) {
/* variable-length fieldwith maximum length > 255 */
len = 0x7fff;
}
if (col->prtype & DATA_NOT_NULL) {
len |= 0x8000;
}
if (log_ptr + 2 > log_end) {
mlog_close(mtr, log_ptr);
ut_a(total > (ulint) (log_ptr - log_start));
total -= log_ptr - log_start;
alloc = total;
if (alloc > mtr_buf_t::MAX_DATA_SIZE) {
alloc = mtr_buf_t::MAX_DATA_SIZE;
}
log_start = log_ptr = mlog_open(mtr, alloc);
if (!log_ptr) {
return(NULL); /* logging is disabled */
}
log_end = log_ptr + alloc;
}
mach_write_to_2(log_ptr, len);
log_ptr += 2;
}
}
if (size == 0) {
mlog_close(mtr, log_ptr);
log_ptr = NULL;
} else if (log_ptr + size > log_end) {
mlog_close(mtr, log_ptr);
log_ptr = mlog_open(mtr, size);
}
return(log_ptr);
}

这里对compact记录类型和redundant记录类型做了区分如下:

1
2
3
4
5
if (!page_rec_is_comp(rec)) {
// redundant record type redo log index
} else {
// compact record type redo log index
}

由于我们的类型是compact因此我们只关注下面的分支。
从mtr中的log buffer获取redo buffer
将重点放在comapct record type上面,刚开始可以看到从mtr的log buffer中获取缓冲块:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ulint i;
ulint n = dict_index_get_n_fields(index);
ulint total = 11 + size + (n + 2) * 2;
ulint alloc = total;
if (alloc > mtr_buf_t::MAX_DATA_SIZE) {
alloc = mtr_buf_t::MAX_DATA_SIZE;
}
/* For spatial index, on non-leaf page, we just keep2 fields, MBR and page no. */
if (dict_index_is_spatial(index)
&& !page_is_leaf(page_align(rec))) {
n = DICT_INDEX_SPATIAL_NODEPTR_SIZE;
}
log_start = log_ptr = mlog_open(mtr, alloc);
if (!log_ptr) {
return(NULL); /* logging is disabled */
}
log_end = log_ptr + alloc;

分析:

  1. total的大小来源于几个方面:

    • 11bytes的作用?
    • size分配大小的意义?
    • (n + 2) * 2,这其中n是rec中列的个数,这又是什么意思?
      1
      size = 1 + DATA_ROLL_PTR_LEN + 14 + 2 + MLOG_BUF_MARGIN;
  2. 从mtr中的log buffer中获取缓冲块,其中log buffer是一个个block buffer以list形式 组织起来,具体可见mtr中的log类型的实现。

    mlog write initial log record fast
    接下来我们重点来看srcC++{logptr = mlog_write_initial_log_record_fast(rec, type, logptr, mtr);}在其中写了哪些东西。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
 /********************************************************//**Writes the initial part of a log record (3..11 bytes).If the implementation of this function is changed, allsize parameters to mlog_open() should be adjusted accordingly!@return new value of log_ptr */
UNIV_INLINE
byte*
mlog_write_initial_log_record_fast(
/*===============================*/
const byte* ptr, /*!< in: pointer to (inside) a buffer
frame holding the file page where
modification is made */
mlog_id_t type, /*!< in: log item type: MLOG_1BYTE, ... */
byte* log_ptr,/*!< in: pointer to mtr log which has
been opened */
mtr_t* mtr) /*!< in/out: mtr */
{
const byte* page;
ulint space;
ulint offset;
ut_ad(log_ptr);
ut_d(mtr->memo_modify_page(ptr));
page = (const byte*) ut_align_down(ptr, UNIV_PAGE_SIZE);
space = mach_read_from_4(page + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID);
offset = mach_read_from_4(page + FIL_PAGE_OFFSET);
/* check whether the page is in the doublewrite buffer;the doublewrite buffer is located in pagesFSP_EXTENT_SIZE, ..., 3 * FSP_EXTENT_SIZE - 1 in thesystem tablespace */
if (space == TRX_SYS_SPACE
&& offset >= FSP_EXTENT_SIZE && offset < 3 * FSP_EXTENT_SIZE) {
if (buf_dblwr_being_created) {
/* Do nothing: we only come to this branch in an InnoDB database creation. We do not redo log anything for the doublewrite buffer pages. */

return(log_ptr);
} else {
ib::error() << "Trying to redo log a record of type "
<< type << " on page "
<< page_id_t(space, offset) << "in the"
" doublewrite buffer, continuing anyway."
" Please post a bug report to"
" bugs.mysql.com.";
ut_ad(0);
}
}
return(mlog_write_initial_log_record_low(type, space, offset,
log_ptr, mtr));

这里我有一个疑问即:

1
/*Do nothing: we only come to this branch in an InnoDB database creation. We do not redo loganything for the doublewrite buffer pages. */

他说这里的doublewrite buffer pages不需要redo log,为什么不要? 接下来再看

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/** Writes a log record about an operation.param[in] type redo log record type*/

//@param[in] space_id tablespace identifier
//@param[in] page_no page number
//@param[in,out] log_ptr current end of mini-transaction log
//@param[in,out] mtr mini-transaction
//@return end of mini-transaction log

UNIV_INLINE
byte*
mlog_write_initial_log_record_low(
mlog_id_t type,
ulint space_id,
ulint page_no,
byte* log_ptr,
mtr_t* mtr)
{
ut_ad(type <= MLOG_BIGGEST_TYPE);
ut_ad(type == MLOG_FILE_NAME
|| type == MLOG_FILE_DELETE
|| type == MLOG_FILE_CREATE2
|| type == MLOG_FILE_RENAME2
|| type == MLOG_INDEX_LOAD
|| type == MLOG_TRUNCATE
|| mtr->is_named_space(space_id));
mach_write_to_1(log_ptr, type);
log_ptr++;
log_ptr += mach_write_compressed(log_ptr, space_id);
log_ptr += mach_write_compressed(log_ptr, page_no);
mtr->added_rec();
return(log_ptr);
}

写入事务的类型,

1
spaceid, pageno.

这里仍有疑问:即在开始为什么要这样断言?

  • 表示日志不能超过其上限srcC++{ MLOGBIGGESTTYPE}
  • 下面的断言说明了事务的各种情况
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
ut_ad(
/** note the first use of a tablespace file since checkpoint */
type == MLOG_FILE_NAME
/** delete a tablespace file that starts with (space_id,page_no) */
|| type == MLOG_FILE_DELETE
/** log record about creating an .ibd file, with format */
|| type == MLOG_FILE_CREATE2
/** rename a tablespace file that starts with (space_id,page_no) */
|| type == MLOG_FILE_RENAME2
/** notify that an index tree is being loaded without writing
redo log about individual pages */
|| type == MLOG_INDEX_LOAD
/** Table is being truncated. (Marked only for file-per-table) */
|| type == MLOG_TRUNCATE
/*Check if a tablespace is associated with the mini-transaction(needed for generating a MLOG_FILE_NAME record)*/
|| mtr->is_named_space(space_id));

下面是长度的压缩算法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
 /*********************************************************//**Writes a ulint in a compressed form where the first byte codes thelength of the stored ulint. We look at the most significant bits ofthe byte. If the most significant bit is zero, it means 1-byte storage,else if the 2nd bit is 0, it means 2-byte storage, else if 3rd is 0,it means 3-byte storage, else if 4th is 0, it means 4-byte storage,else the storage is 5-byte.@return compressed size in bytes */

UNIV_INLINE
ulint
mach_write_compressed(
/*==================*/
byte* b, /*!< in: pointer to memory where to store */
ulint n) /*!< in: ulint integer (< 2^32) to be stored */
{
ut_ad(b);
if (n < 0x80) {
/* 0nnnnnnn(7 bits) */
mach_write_to_1(b, n);
return(1);
} else if (n < 0x4000) {
/* 10nnnnnn nnnnnnnn (14 bits) */
mach_write_to_2(b, n | 0x8000);
return(2);
} else if (n < 0x200000) {
/* 110nnnnn nnnnnnnn nnnnnnnn (21 bits) */
mach_write_to_3(b, n | 0xC00000);
return(3);
} else if (n < 0x10000000) {
/* 1110nnnn nnnnnnnn nnnnnnnn nnnnnnnn (28 bits) */
mach_write_to_4(b, n | 0xE0000000);
return(4);
} else {
/* 11110000 nnnnnnnn nnnnnnnn nnnnnnnn nnnnnnnn (32 bits) */
mach_write_to_1(b, 0xF0);
mach_write_to_4(b + 1, n);
return(5);
}
}

写入redo 记录个数和唯一性列的个数
接下来的:

1
2
3
4
5
6
7
8
9
10
11
mach_write_to_2(log_ptr, n);
log_ptr += 2;
if (page_is_leaf(page_align(rec))) {
mach_write_to_2(
log_ptr, dict_index_get_n_unique_in_tree(index));
} else {
mach_write_to_2(
log_ptr,
dict_index_get_n_unique_in_tree_nonleaf(index));
}
log_ptr += 2;

两步:

  1. 写入记录的个数。
  2. 写入 srcC++{index->nuniq/!< number of fields from the beginning which are enough to determine an index entry uniquely /} 但是不知道: 这个域的作用是什么? 以及为什么要写入? 回去查资料发现这个srcC++{nuniq}代表的是记录中唯一性列的个数

写入行记录上决定唯一性的列的个数,占两个字节 (dictindexgetnuniqueintree) 对于聚集索引,就是PK上的列数;对于二级索引,就是二级索引列+PK列个数

写入redo记录中各列的长度
如下代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
for (i = 0; i < n; i++) {
dict_field_t* field;
const dict_col_t* col;
ulint len;
field = dict_index_get_nth_field(index, i);
col = dict_field_get_col(field);
len = field->fixed_len;
ut_ad(len < 0x7fff);
if (len == 0
&& (DATA_BIG_COL(col))) {
/* variable-length field with maximum length > 255 */
len = 0x7fff;
}
if (col->prtype & DATA_NOT_NULL) {
len |= 0x8000;
}
if (log_ptr + 2 > log_end) {
mlog_close(mtr, log_ptr);
ut_a(total > (ulint) (log_ptr - log_start));
total -= log_ptr - log_start;
alloc = total;
if (alloc > mtr_buf_t::MAX_DATA_SIZE) {
alloc = mtr_buf_t::MAX_DATA_SIZE;
}
log_start = log_ptr = mlog_open(mtr, alloc);
if (!log_ptr) {
return(NULL); /* logging is disabled */
}
log_end = log_ptr + alloc;
}
mach_write_to_2(log_ptr, len);
log_ptr += 2;
}

则主要是写入记录中每个列的长度值得注意的是:

  • 当是TEXT,BLOG等大记录时:len = 0x7fff
  • 这里的 col->prtype 决定了一个列的精确类型:

    数据类型varchar,char,int等等。
    charset code(字符集编码)
    是否可以为NULL
    是否是有符号的
    是否是一个binary string
    是否是两个字节存储的varchar类型

另外这里的 DATA_NOT_NULL 是说明该类型是否为NULL的掩码。
最后是如果由于记录中的列的数目过多,从而导致一个block buffer装不下,则分配一个新的块。
最后的判断
接下来是最后的 log_ptr 的长度判断,主要这里的 size 的定义为请求的buffer size

  • 如果为0则 log_ptr=NULL
  • 如果 log_ptr + size > log_end ,则表示当前缓冲无法分配 size 的内存大小, 重新分配一个block buffer。
  • 写入redo log body
    写入flag标志
    接下来我们可以看到:
1
2
3
/* For secondary indexes, we could skip writing the dummy system fields to the redo log but we have to change redo log parsing ofMLOG_REC_UPDATE_IN_PLACE/MLOG_COMP_REC_UPDATE_IN_PLACE or we have to addnew redo log record. For now, just write dummy sys fields to the redo log if we are updating a secondary index record.*/
mach_write_to_1(log_ptr, flags);
log_ptr++;

这个 flags 说的是:在 btr_cur 上面的操作类型,但是它的具体作用这里也是未知?
写入记录的域中的TRXID,ROLLID
代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/*********************************************************************//**Writes into the redo log the values of trx id and roll ptr and enough infoto determine their positions within a clustered index record.@return new pointer to mlog */

byte*
row_upd_write_sys_vals_to_log(
/*==========================*/
dict_index_t* index, /*!< in: clustered index */
trx_id_t trx_id, /*!< in: transaction id */
roll_ptr_t roll_ptr,/*!< in: roll ptr of the undo log record */
byte* log_ptr,/*!< pointer to a buffer of size > 20 openedin mlog */
mtr_t* mtr MY_ATTRIBUTE((unused))) /*!< in: mtr */
{
ut_ad(dict_index_is_clust(index));
ut_ad(mtr);
log_ptr += mach_write_compressed(log_ptr,
dict_index_get_sys_col_pos(
index, DATA_TRX_ID));
trx_write_roll_ptr(log_ptr, roll_ptr);
log_ptr += DATA_ROLL_PTR_LEN;
log_ptr += mach_u64_write_compressed(log_ptr, trx_id);
return(log_ptr);
}

分三步进行:

  1. 使用压缩方式写入 DATA_TRX_ID 在物理存储中的的位置(从0开始)
  2. 写入Roll ID
  3. 以压缩方式写入Trx ID

写入记录在页中的偏移

1
2
mach_write_to_2(log_ptr, page_offset(rec));
log_ptr += 2;

写入当前记录在页中的相对偏移。

写入更新列的infobits和更新列的总数
代码:

1
2
3
4
5
n_fields = upd_get_n_fields(update);
buf_end = log_ptr + MLOG_BUF_MARGIN;
mach_write_to_1(log_ptr, update->info_bits);
log_ptr++;
log_ptr += mach_write_compressed(log_ptr, n_fields);

写入更新列的在记录中的次序和更新列的大小和更新的列

1
2
3
4
5
6
7
8
9
upd_field = upd_get_nth_field(update, i);
new_val = &(upd_field->new_val);
len = dfield_get_len(new_val);
/* If this is a virtual column, mark it using specialfield_no */
ulint field_no = upd_fld_is_virtual_col(upd_field)
? REC_MAX_N_FIELDS + upd_field->field_no
: upd_field->field_no;
log_ptr += mach_write_compressed(log_ptr, field_no);
log_ptr += mach_write_compressed(log_ptr, len);

接下来是写入记录了,如果记录过大则需要几个block buffer来装载。

总结
此时在mtr中的redo log buffer中此时的数据为(一共为0x29bytes):
Alt text
Alt text
总结为下表:
Alt text

mtr的提交 - mtr rede log buffer 写入redo log_sys buffer

mtr的提交过程

上面的redo log过程完毕之后从 row_upd_clust_rec 函数中的 btr_cur_optimistic_update 转到 mtr_commit 来提交事务,先看 mtr_commit 的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/** Commit a mini-transaction. */
void
mtr_t::commit()
{
ut_ad(is_active());
ut_ad(!is_inside_ibuf());
ut_ad(m_impl.m_magic_n == MTR_MAGIC_N);
m_impl.m_state = MTR_STATE_COMMITTING;
/* This is a dirty read, for debugging. */
ut_ad(!recv_no_log_write);
Command cmd(this);
if (m_impl.m_modifications
&& (m_impl.m_n_log_recs > 0
|| m_impl.m_log_mode == MTR_LOG_NO_REDO)) {
ut_ad(!srv_read_only_mode
|| m_impl.m_log_mode == MTR_LOG_NO_REDO);
cmd.execute();
} else {
cmd.release_all();
cmd.release_resources();
}
}

重点关注的几个方面是:

  • m_impl.m_state : 说明事务的状态,它有以下几个状态:
1
2
3
4
5
6
enum mtr_state_t {
MTR_STATE_INIT = 0,
MTR_STATE_ACTIVE = 12231,
MTR_STATE_COMMITTING = 56456,
MTR_STATE_COMMITTED = 34676
};
  • Class Command 作为 mtr_t 的内部类,它的作用是什么?为什么要这样设计?基于什 么样的原理?另外还要提到的一点是:为什么要在mtr_t 中将 m_impl 作为一个结构 体设计?
  • cmd.execute() 是事务提交的主要函数,在下面将详细分析先看其代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/** Write the redo log record, add dirty pages to the flush list and releasethe resources. */
void
mtr_t::Command::execute()
{
ut_ad(m_impl->m_log_mode != MTR_LOG_NONE);
if (const ulint len = prepare_write()) {
finish_write(len);
}
if (m_impl->m_made_dirty) {
log_flush_order_mutex_enter();
}
/* It is now safe to release the log mutex because theflush_order mutex will ensure that we are the first oneto insert into the flush list. */
log_mutex_exit();
m_impl->m_mtr->m_commit_lsn = m_end_lsn;
release_blocks();
if (m_impl->m_made_dirty) {
log_flush_order_mutex_exit();
}
release_latches();
release_resources();
}

提交准备: mtr_t::Command::prepare_write()

1. 代码1 :定义了各种不同的日志模式下日志的写法

1
2
3
4
5
6
7
8
9
10
11
12
switch (m_impl->m_log_mode) {
case MTR_LOG_SHORT_INSERTS:
ut_ad(0);
/* fall through (write no redo log) */
case MTR_LOG_NO_REDO:
case MTR_LOG_NONE:
ut_ad(m_impl->m_log.size() == 0);
log_mutex_enter();
m_end_lsn = m_start_lsn = log_sys->lsn;
return(0);
case MTR_LOG_ALL:
break

2. 代码2 :如果redo log sys buffer空间不足的情况

1
2
3
4
5
6
7
8
ulint len = m_impl->m_log.size();
ulint n_recs = m_impl->m_n_log_recs;
ut_ad(len > 0);
ut_ad(n_recs > 0);
if (len > log_sys->buf_size / 2) {
log_buffer_extend((len + 1) * 2);
}
ut_ad(m_impl->m_n_log_recs == n_recs);

这里的 log_sys->buf_size 是 redo log可用buffer的大小么? 暂时认为就是可用的log buffer的大小 ,可以看到当一个mini-transaction提交的redo log 日志大小大于当前redo logsys的buffer的一半时,需要扩展logsys的buffer大小至 (len+1)*2 。那么接下需要关注的是: log_buffer_extend 做了哪些动作,它的内存是向谁申请的呢?

由于目前程序无法运行该分支,具体的细节可能不是特别清楚,但是可以知道的大概步骤是 首先刷盘,并且有可能重新分配redo logsys 的buffer的大小。
3. 代码3 :当space id为system tablespace 或者 undo tablespace 的情况

1
2
3
4
5
fil_space_t* space = m_impl->m_user_space;
if (space != NULL && is_system_or_undo_tablespace(space->id)) {
/* Omit MLOG_FILE_NAME for predefined tablespaces. */
space = NULL;
}

判断当前的 space->id 是否属于undo段或者system的表空间,在调试中发现以下几点:

  • system space id 为0
  • 此时好像并没有分配undo的表空间,但是之前已经写了undo log,但是在代码里发现: srv_undo_space_id_start 为0,这是说明没有为 undo log 分配table space么?

4. 代码4 :检查当前mtr所操纵的Tablespace是否为第一次修改,如果是则写入 MLOGFILENAME日志

1
2
3
4
5
6
7
if (fil_names_write_if_was_clean(space, m_impl->m_mtr)) {
/* This mini-transaction was the first one to modifythis tablespace since the latest checkpoint, so some MLOG_FILE_NAME records were appended to m_log. */
ut_ad(m_impl->m_n_log_recs > n_recs);
mlog_catenate_ulint(
&m_impl->m_log, MLOG_MULTI_REC_END, MLOG_1BYTE);
len = m_impl->m_log.size();
}

分两步:

  1. fil_names_write_if_was_clean 的判断。
  2. 写mtr日志结束标记

fil_names_write_if_was_clean 检查该space是否从last checkpoint 第一次修改
该函数会形成一个比较深的调用链:

1
2
3
4
5
6
fil_op_write_log
fil_name_write
fil_name_write
fil_names_write
fil_names_dirty_and_write
fil_names_write_if_was_clean

这些调用链主要做了以下几个事情:

  1. 检测该页是否是刷出之后第一次修改(func:fil_names_write_if_was_clean):
1
2
3
4
5
6
 const bool was_clean = space->max_lsn == 0;
ut_ad(space->max_lsn <= log_sys->lsn);
space->max_lsn = log_sys->lsn;
if (was_clean) {
fil_names_dirty_and_write(space, mtr);
}

通过上面的代码我们可以得知:每次脏页被刷盘之后,其页面上的LSN 都会被标记为 0, 上面正是通过判断一个Tablespace 的最大的LSN是否为0,来判断该Tablespace是否为第 一次修改。

  1. 将当前的space相关的描述加入到 fil_system->named_spaces 的链表上面 (func:fil_names_dirty_and_write)
  2. 我们分析函数: fil_op_write_logmtr redo log buffer中写入了哪些东西,其中大 部分函数之前都接触过,因此这里不再该函数,不过需要额外注意的是该函数中对 redo log 类型为 MLOG_FILE_RENAME2 做了很多额外的判断,这点在以后需要多加注意 ,下面 给出该函数的具体的写的数据:
    Alt text
    该函数的注释则说明了该函数的作用:
1
/* This mini-transaction was the first one to modifythis tablespace since the latest checkpoint, sosome MLOG_FILE_NAME records were appended to m_log. */

这样方便在recovery的时候,方便读入所需的表空间文件,减少redo。
第一次修改该Tablespace时的mtr redo 结束标记
代码:

1
2
3
4
5
/* This mini-transaction was the first one to modifythis tablespace since the latest checkpoint, sosome MLOG_FILE_NAME records were appended to m_log. */
ut_ad(m_impl->m_n_log_recs > n_recs);
mlog_catenate_ulint(
&m_impl->m_log, MLOG_MULTI_REC_END, MLOG_1BYTE);
len = m_impl->m_log.size();

可以看到其在mtr redo log buffer中写入了类型为 MLOG_MULTI_REC_END 的日志。
Alt text
5. 代码5 :如果不是第一次修改该页面时的mtr redo log结束标记

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
else {
/* This was not the first time of dirtying atablespace since the latest checkpoint. */
ut_ad(n_recs == m_impl->m_n_log_recs);
if (n_recs <= 1) {
ut_ad(n_recs == 1);
/* Flag the single log record as theonly record in this mini-transaction. */
*m_impl->m_log.front()->begin()
|= MLOG_SINGLE_REC_FLAG;
} else {/* Because this mini-transaction comprisesmultiple log records, append MLOG_MULTI_REC_END at the end. */
mlog_catenate_ulint(
&m_impl->m_log, MLOG_MULTI_REC_END,
MLOG_1BYTE);
len++;
}
}

上面代码的逻辑很简单:

  • 如果该mtr只有单条redo log record(这里我很好奇的是为什么undo log写的redo log不算在 内?),则修改之前的日志类型
    为: type|MLOG_SINGLE_REC_FLAG
  • 如果该mtr为多条redo log record,则在末尾写入: MLOG_MULTI_REC_END

    检查 redo log_sys buffer 的空间是否足够 - void log_margin_checkpoint_age

    首先通过 ulint margin = log_calculate_actual_len(len); 来计算mtr产生的redo 日志的长度,这里附加了每个redo log block 的 header 和 tailer 的长度。
    1. 代码1 : 检查mtr的redo log length是否超过 redo log group的长度:96M
    这里 log_sys->log_group_capacity 是日志组两个file:iblogfile0,iblogfile1的总 长度,这两个文件的长度均为:48M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
if (margin > log_sys->log_group_capacity) {

/* return with warning output to avoid deadlock */

if (!log_has_printed_chkp_margine_warning
|| difftime(time(NULL),
log_last_margine_warning_time) > 15) {
log_has_printed_chkp_margine_warning = true;
log_last_margine_warning_time = time(NULL);
ib::error() << "The transaction log files are too"
" small for the single transaction log (size="
<< len << "). So, the last checkpoint age"
" might exceed the log group capacity "
<< log_sys->log_group_capacity << ".";
}
return;
}

2. 代码2 : 如果mtr的 redo log length + LSN - checkpointLSN超过了 redo log group 的总长度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
/* Our margin check should ensure that we never reach this condition.Try to do checkpoint once. We cannot keep waiting here as it mightresult in hang in case the current mtr has latch on oldest lsn */
if (log_sys->lsn - log_sys->last_checkpoint_lsn + margin
> log_sys->log_group_capacity) {
/* The log write of 'len' might overwrite the transaction logafter the last checkpoint. Makes checkpoint. */
bool flushed_enough = false;
if (log_sys->lsn - log_buf_pool_get_oldest_modification()
+ margin
<= log_sys->log_group_capacity) {
flushed_enough = true;
}
log_sys->check_flush_or_checkpoint = true;
log_mutex_exit();
DEBUG_SYNC_C("margin_checkpoint_age_rescue");
if (!flushed_enough) {
os_thread_sleep(100000);
}
log_checkpoint(true, false);
log_mutex_enter();
}

首先可以看到 log_sys->lsn - log_sys->last_checkpoint_lsn + margin >log_sys->log_group_capacity 表示至上次checkpoint开始,产生的redo log 超过redo log group 的长度,这可能会导致持久化时,redo日志不能刷盘,因此这时必须先将buffer pool里最先修改的page刷盘,因此这里做了一次检查:

1
2
3
4
if (log_sys->lsn - log_buf_pool_get_oldest_modification()
+ margin<= log_sys->log_group_capacity) {
flushed_enough = true;
}

检查 logsys的LSN减去bufferpool 此时oldest modification 的LSN的小于redo log group的长度,这样可以不用刷buffer pool 中的页面到磁盘,减少开销。另外此时的需要 对部分的redo log buffer 刷盘见 log_checkpoint 函数。
3. 代码3 : redo log checkpoint 简述
由于代码里面跟不到该分支,因此这里简析上面的 log_checkpoint 步骤如下:

  • fil_names_clear
  1. 如果当前有事务正在提交,则追加一个redo 记录,但是这里不太清楚为什么要这样做?
  2. 将mtr的redo log buffer写入到redo logsys的buffer中,关键函数: log_write_low
  3. 开启一个mtr将当前所有打开的并且需要刷盘的Tablespace写一个 MLOG_FILE_NAME 类型的日志到 mtr的redo log buffer中,如果一次提交到mtr redo log buffer的日志过多,可能会分多次提交。每次提交都会提交一个mtr redo log结束标记: MLOG_MULTI_REC_END , MLOG_SINGLE_REC_FLAG ,并且会额外写入 一个 MLOG_CHECKPOINT 和8byte的checkpoint LSN。
  4. 之后会用下面的两条语句做:写日志到日志缓冲和释放资源等动作。
  • 将日志刷盘: log_write_up_to : 主要是redo log的刷盘操作,将在稍后分析。
  • 最后在 log group header 写入此次checkpoint 的相关信息 : log_write_checkpoint_info(sync); 写入此次的 checkpoint 的checkpoint_lsn, checkpoint_no 等信息。

    至此 prepare_write 完毕,接下来看 mtr_t::Command::finish_write

    这是将mtr redo log buffer 里面的redo log record 写入 redo logsys buffer 的主要函数:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/** Append the redo log records to the redo log buffer@param[in] len number of bytes to write */
void
mtr_t::Command::finish_write(
ulint len)
{
ut_ad(m_impl->m_log_mode == MTR_LOG_ALL);
ut_ad(log_mutex_own());
ut_ad(m_impl->m_log.size() == len);
ut_ad(len > 0);
if (m_impl->m_log.is_small()) {
const mtr_buf_t::block_t* front = m_impl->m_log.front();
ut_ad(len <= front->used());
m_end_lsn = log_reserve_and_write_fast(
front->begin(), len, &m_start_lsn);
if (m_end_lsn > 0) {
return;
}
}
/* Open the database log for log_write_low */
m_start_lsn = log_reserve_and_open(len);
mtr_write_log_t write_log;
m_impl->m_log.for_each_block(write_log);
m_end_lsn = log_close();
}

这里分为两个分支:

  1. mtr 中的redo log buffer没有写满一个block(508byte)时;
  2. mtr 中的redo log buffer写入到了多个block时。

这里暂时只看了第一部分,当 redo log_sys buffer 只写入了一个 block 时的情况。

  1. mtr redo log buffer 中只有第一个block被写入了 redo log record
    代码:
1
2
3
4
5
6
7
8
9
10
if (m_impl->m_log.is_small()) {
const mtr_buf_t::block_t* front = m_impl->m_log.front();
ut_ad(len <= front->used());
m_end_lsn = log_reserve_and_write_fast(
front->begin(), len, &m_start_lsn);
if (m_impl->m_log.is_small()) {
const mtr_buf_t::block_t* front = m_impl->m_log.front();
ut_ad(len <= front->used());
m_end_lsn = log_reserve_and_write_fast(
front->begin(), len, &m_start_lsn);

这里的 log_reserve_and_write_fast 是将mtr redo log buffer中存储的redo log record写入redo logsys buffer。下面我们来分析这个函数。这里提取该函数的关键代码 段,这里去掉了 #ifdef UNIV_LOG_LSN_DEBUG /*code*/ #endif /* UNIV_LOG_LSN_DEBUG*/ 的代码段,由此可见一个mtr的redo log 并不会写LSN到redo logsysbuffer 中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#ifndef UNIV_HOTBACKUP
/** Append a string to the log.*/
// @param[in] str string
//@param[in] len string length
// @param[out] start_lsn start LSN of the log record
// @return end lsn of the log record, zero if did not succeed

UNIV_INLINE
lsn_t
log_reserve_and_write_fast(
const void* str,
ulint len,
lsn_t* start_lsn)
{
const ulint data_len = len
+ log_sys->buf_free % OS_FILE_LOG_BLOCK_SIZE;
if (data_len >= OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE) {
/* The string does not fit within the current log block or the log block would become full */
return(0);
}
*start_lsn = log_sys->lsn;
memcpy(log_sys->buf + log_sys->buf_free, str, len);
//更新log_sys buffer block 的 LOG_BLOCK_HDR_DATA_LEN
log_block_set_data_len(
reinterpret_cast<byte*>(ut_align_down(
log_sys->buf + log_sys->buf_free,
OS_FILE_LOG_BLOCK_SIZE)),
data_len);
log_sys->buf_free += len;
log_sys->lsn += len;
return(log_sys->lsn)
}
/************************************************************//**Sets the log block data length. */
UNIV_INLINE
void
log_block_set_data_len(
/*===================*/
byte* log_block, /*!< in/out: log block */
ulint len) /*!< in: data length */
{
mach_write_to_2(log_block + LOG_BLOCK_HDR_DATA_LEN, len);
}

此时 redo log_sys buffer 中的数据

之后还会做一些释放资源的动作。 由此,我们可以总结下缓冲中的数据:

  1. redo record of update in place ;
  2. first modify the tablespace;
  3. mtr multi redo rec end.

    redo log 写盘

    首先我们应该确认的是:所有触发redo logsysbuffer 刷盘的情况。 接下来说下方法通过观察那些函数调用了 redo log 写盘的核心函数,这里我选择了:
  4. log_checkpoint
  5. log_write_up_to
    以下是我在代码中看到的各种情况:

    情形1:当redo 日志数量在短期增多超过 log group 的大小

    如下面的代码所述,这种情况只存在于在短期内redo log井喷式增加。 先看代码(func: log_margin_checkpoint_age )
1
2
3
4
5
6
7
8
// func: log_margin_checkpoint_age
/* Our margin check should ensure that we never reach this condition.Try to do checkpoint once. We cannot keep waiting here as it might result in hang in case the current mtr has latch on oldest lsn */
if (log_sys->lsn - log_sys->last_checkpoint_lsn + margin
> log_sys->log_group_capacity) {
/*code*/
log_sys->check_flush_or_checkpoint = true;
log_checkpoint(true, false);
}

这里:

  1. log_sys->lsn: logsys当前buffer更新的最新的lsn。
  2. log_sys->last_checkpoint_lsn: latest checkpoint lsn,是指 ~buffer pool~里面的 数据刷新到的最新的lsn。
  3. margin: 当前要插入的redo log record的长度。
  4. log_sys->log_group_capacity: 当前redo log group的容量,这里是指iblogfile0和 iblogfile1的总容量,我的机子上是48M+48M=96M。
    这里的 log_sys->check_flush_or_checkpoint = true 会激发主线程来执行 log_free_check(void) 来执行脏页的刷新和日志的刷新等相关的检查
    因此的代码说明了redo checkpoint 的一种情况:

结论1:如果当前的drity page所产生的redo log大于整个 redo log group 的容量则会将 redolog_sys buffer 里面的日志刷盘checkpoint

情形2:当事务提交时,会将 redo logsysbuffer 刷盘

代码(func: trx_flush_log_if_needed_low)

1
2
3
4
5
6
7
8
9
10
11
12
13
switch (srv_flush_log_at_trx_commit) {
case 2:
/* Write the log but do not flush it to disk */
flush = false;
/* fall through */
case 1:
/* Write the log and optionally flush it to disk */
log_write_up_to(lsn, flush);
return;
case 0: //每隔1s将redo写盘
/* Do nothing */
return;
}

说明当 srv_flush_log_at_trx_commit 值不同刷新方式也会有所不同:

  • 0:表示Master Thread 每隔1s将redo log sys buffer刷盘
  • 1:写入日志文件并刷新,在每次事务提交时
  • 2:写入日志文件但不刷新,在每次事务提交时
    而这个在提交过程中的的堆栈流程则为:
1
2
3
4
5
log_write_up_to(lsn_t lsn, bool flush_to_disk)
trx_flush_log_if_needed_low(lsn_t lsn)
trx_flush_log_if_needed(lsn_t lsn, trx_t * trx)
trx_commit_complete_for_mysql(trx_t * trx)
innobase_commit(handlerton * hton, THD * thd, bool commit_trx)

这个事务提交包含了

  1. 普通事务提交: trx_commit_complete_for_mysql
  2. 临时表事务的提交 trx_commit_in_memory
  3. 与group commit 相关: trx_prepare
    其中 srv_flush_log_at_trx_commit==0 时的刷盘会在下面说到,而另外两种刷盘的实现则是在 log_write_up_to 中进行。
    由此我们可以确定的是:
    结论2: 当事务提交时,会将redo logsysbuffer 里面的日志刷盘

    1. srv_flush_log_at_trx_commit==0 时的实现
      该参数为0,说明每隔1s提交一次redo日志。 (func: srv_master_do_active_tasks, srv_master_do_idle_tasks,
      srv_master_do_shutdown_tasks)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/********************************************************************//**The master thread is tasked to ensure that flush of log file happens once every second in the background. This is to ensure that not more than one second of trxs are lost in case of crash when innodb_flush_logs_at_trx_commit != 1 */
static
void
srv_sync_log_buffer_in_background(void)
/*===================================*/
{
time_t current_time = time(NULL);
srv_main_thread_op_info = "flushing log";
//这里srv_flush_log_at_timeout
if (difftime(current_time, srv_last_log_flush_time)
>= srv_flush_log_at_timeout) {
log_buffer_sync_in_background(true);
srv_last_log_flush_time = current_time;
srv_log_writes_and_flush++;
}
}

只要两次刷新的时间差超过1s则立即做刷新动作。

3 情形3:主线程写盘

代码-1(func: os_thread_ret_t DECLARE_THREAD(srv_master_thread)):

1
2
3
4
5
6
if (srv_check_activity(old_activity_count)) {
old_activity_count = srv_get_activity_count();
srv_master_do_active_tasks();
} else {
srv_master_do_idle_tasks();
}

主线程在 activity 和 idle 均会将redo logsysbuffer 刷盘

  1. srv_master_do_active_tasks 中的代码:
1
2
3
4
5
6
7
8
/* Make a new checkpoint */
//
if (cur_time % SRV_MASTER_CHECKPOINT_INTERVAL == 0) {
srv_main_thread_op_info = "making checkpoint";
log_checkpoint(TRUE, FALSE);
MONITOR_INC_TIME_IN_MICRO_SECS(
MONITOR_SRV_CHECKPOINT_MICROSECOND, counter_time);
}

其频率是7s刷一次盘。

  1. srv_master_do_idle_tasks 中的代码:
1
2
3
4
5
/* Make a new checkpoint */
srv_main_thread_op_info = "making checkpoint";
log_checkpoint(TRUE, FALSE);
MONITOR_INC_TIME_IN_MICRO_SECS(MONITOR_SRV_CHECKPOINT_MICROSECOND,
counter_time);

则是每时每刻都在做提交动作。
另外,还需提到的是由于在事务提交采用不同的参数提交时,会导致redo logsys buffer的 增长,master线程还采用了其他的方式,来实现redo logsys buffer的刷盘。

结论3: 主线程会间隔一段时间将日志刷盘并checkpoint

情形4:由 log_sys 的标记刷盘 - log_sys->check_flush_or_checkpoint

当这个标记会引起主线程执行 log_free_check 函数的里面的检查,从而引起日志刷盘, 这时可能触发日志和脏页的刷新。

redo 日志模型的说明

这一节主要集中来讲redo日志的模型相关

redo 日志类型及其作用

这一节主要集中来讲redo日志的模型相关(目前只统计我们用到的日志类型): ,#+NAME: redo log record type
Alt text
Alt text

Ref

数据库内核月报 - 2015 / 05-MySQL · 引擎特性 · InnoDB redo log漫游